Introduction to Open Data Science is a course organized by Doctoral school in the humanities and social sciences (HYMY),University of Helsinki. As it is mentioned in the course page “The name of this course refers to THREE BIG THEMES: 1) Open Data, 2) Open Science, and 3) Data Science”. The learning platform for this course is MOOCs (Massive Open Online Courses) of the University of Helsinki: http://mooc.helsinki.fi. More information about the course can be found in this link: https://courses.helsinki.fi/en/hymy005/120776718.
In the first week of the course, I have gone through the general instructions of the course and get familiarize my self with the learning tools, the course content and schedule. Though I had previously experience with R and RStudio, I have done all the exercises and instructions given in “DataCamp: R Short and Sweet” and refresh my R. I completed all “R Short and Sweet” exercise and statement of accomplishment published here. The other exerices, I have done in this first week is “RStudio Exercise 1”. I already had a GitHub account and to follow the exercises and instructions for this exercise, I forked the IODS-project repository from Tuomo Nieminen’s github. After forking, I clone the IODS-project repository to my local machine (my computer) using the command git clone. After cloned the respostry to my computer, Using RStudio, I edited the existing Rmarkdown file (chapter1.Rmd, chapter2.Rmd and index.Rmd) in the repository and also add new Rmarkdown file and save as in the file name README.Rmd. Next, I commit the changes what have been done in the local machine using the git command git commit -m and finally push to github using the command git push. I have also canged the theme of my course diary web page to Merlot theme. In this exercises, I refresh my git and Github knowledge and also familiarize with how I will use the GitHub for this course to share and publish my learning diary.
In this section, the data set given in this link has been preprocess for further/downstream analysis. After creating a folder named ‘data’ in my IODS-project repository, using Rstudio a new R script named ‘create_learning2014’ file created and write all the steps used in the data wrangling process and saved in the ‘data’ folder. All the steps I have done in the data wrangling preprocess can be find here and/or in the code shown below.
The data for this section is extracted from a survey conducted by Kimmo Vehkalahti on teaching and learning approaches. The research project made possible by the Academy of Sciences funding to Teachers’ Academy funding (2013-2015). The data was collected from 3.12.2014 - 10.1.2015.The surrvey includes 183 respondants and the questioner in toatal includes 60 variables/questions.
students2014=read.table("data/learning2014.txt", sep="\t", header=TRUE)
dim(students2014)## [1] 166 7
str(students2014)## 'data.frame': 166 obs. of 7 variables:
## $ gender : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
## $ Age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ Points : int 25 12 24 10 22 21 21 31 24 26 ...
summary(students2014)## gender Age attitude deep stra
## F:110 Min. :17.00 Min. :1.400 Min. :1.583 Min. :1.250
## M: 56 1st Qu.:21.00 1st Qu.:2.600 1st Qu.:3.333 1st Qu.:2.625
## Median :22.00 Median :3.200 Median :3.667 Median :3.188
## Mean :25.51 Mean :3.143 Mean :3.680 Mean :3.121
## 3rd Qu.:27.00 3rd Qu.:3.700 3rd Qu.:4.083 3rd Qu.:3.625
## Max. :55.00 Max. :5.000 Max. :4.917 Max. :5.000
## surf Points
## Min. :1.583 Min. : 7.00
## 1st Qu.:2.417 1st Qu.:19.00
## Median :2.833 Median :23.00
## Mean :2.787 Mean :22.72
## 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :4.333 Max. :33.00
After preprocess the raw data, the final data set have 7 column “gender, age, attitude, deep, stra, surf and points” and 166 indvidual as shown in the above run.
plot_students2014=ggpairs(students2014, mapping = aes(col = gender,alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))
plot_students2014The above plot shown there are 110 femal and 56 male respondants in the survey. Moreover the plot shown how the variables (Points, Age, attitude, deep, stra and Surf) are correleted. Based on the absolute correlation value the attitude and points is higly correleted while deep learning and ponts are least corrleted.
The explanatory variable chosen based on (absolute) correlation value and the top three explanatory variable chosen are attitude= 0.437, stra= 0.146 and surf= -0.144. Using this variables the model is build using the R function “lm”.
my_model1 <- lm(Points ~ attitude + stra + surf, data = students2014)
summary(my_model1)##
## Call:
## lm(formula = Points ~ attitude + stra + surf, data = students2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.0171 3.6837 2.991 0.00322 **
## attitude 3.3952 0.5741 5.913 1.93e-08 ***
## stra 0.8531 0.5416 1.575 0.11716
## surf -0.5861 0.8014 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
The significance paramter from the above summary table is the intercept 0.00322 and attitude (1.93e-08) ***. Hence stra nad surf are not condisdered in the model below. using again “lm” function linear regrression model build between points and attitude.The model after removing insignificant variables is summarized below. With regard to multiple r-squared value, we saw that value changed from 0.1927 (in older model) to 0.1856 (newer model). However, F-Statistic (from 14.13 to 38.61) and p-value(3.156e-08 to 4.119e-09) have significantly improved. Thus, we can conclude that r-squared value may not always determine the quality of the model and the lower r-squared value might be due to outliers in the data.
my_model2 <- lm(Points ~ attitude , data = students2014)
summary(my_model2)##
## Call:
## lm(formula = Points ~ attitude, data = students2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
par(mfrow = c(2,2))
plot(my_model2, which=c(1,2,5))The three diagnostic model validation plots are shown above.The assumptions are
Based on the above plots, we can conclude that the errors are normally distributed (clearly observed in q-q plot). Similarly, residual versus fitted model showed that the errors are not dependent on the attitude variable. Moreover, we can see that even two points (towards the right) have minor influence to the assumption in case of residual vs leverage model. All the models have adressed the outliers nicely. Thus, assumptions in all models are more or less valid.
library(GGally)
library(ggplot2)
library(ggpubr)## Loading required package: magrittr
library(tidyr)##
## Attaching package: 'tidyr'
## The following object is masked from 'package:magrittr':
##
## extract
library(dplyr)##
## Attaching package: 'dplyr'
## The following object is masked from 'package:GGally':
##
## nasa
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(magrittr)
library(gmodels)
library(boot)In this section Data wrangling has been done for two data sets retrieved from UCI Machine Learning Repository. The R script used for data wrangling can be found here
The data setes used in this analysis were retrieved from the UCI Machine Learning Repository.The data sets are intend to apprach student achievement in two secondary education Portuguese schools. The data was collected by using school reports and questionaires that includes data alttributes (student grades, demographic, social and school related features). Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por) (Cortez et al. 2008). For this analysis, the original dataset (mat and por) have been modified so that the two separate datasets have been joined. Only students who are answered the questionnaire in both portuguese classe are kept. The final data sets used in this analysis includes a total of 382 observation and 35 attributes.
modified_data= read.table("data/modified_data.txt", sep="\t", header=TRUE)
colnames(modified_data)## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "nursery" "internet" "guardian" "traveltime"
## [16] "studytime" "failures" "schoolsup" "famsup" "paid"
## [21] "activities" "higher" "romantic" "famrel" "freetime"
## [26] "goout" "Dalc" "Walc" "health" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
Inorder to study the relationship between high/low alcohol consumption and variables. I chose four variables (“absences”, “sex”, “goout” and “studytime”). My hypothesis about how each variable is related to alcohol consumption shown below:
studytime: Students who dedicate quite much amount of time in thire study, they don’t have time to go out for drinking alchol (studytime and high/low alcohol consumption negatively correlated)
sex: Male students have higher alcohol consumption than female stduents (Male students and high/low alcohol consumption positively correlated)
goout: Those students going out with friends quite frequently consume high alchol than students don’t going out. (goout and high/low alcohol consumption positively correlated)
absences: Those students who consume more alachol don’t attend class for various reason (e.g hangover, attending party in class time ) (absences and high/low alcohol consumption positively correlated)
A bar plot for demonstrating the distributions of my chosen variable
my_variables= c("absences", "sex", "goout", "studytime", "high_use")
my_variables_data <- select(modified_data, one_of(my_variables))
colnames(my_variables_data)## [1] "absences" "sex" "goout" "studytime" "high_use"
gather(my_variables_data) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()## Warning: attributes are not identical across measure variables;
## they will be dropped
A Box plot for demonstrating my chosen variables and their relationships with alcohol consumption
g1 <- ggplot(modified_data, aes(x = high_use, y = absences,col=sex))
p1=g1 + geom_boxplot() + ylab("absences") + ggtitle("Student absences by alcohol consumption and sex")
g2 <- ggplot(modified_data, aes(x = high_use, y = goout, col=sex))
p2=g2 + geom_boxplot() + ylab("going out with friends") + ggtitle("Student who going out with friends by alcohol consumption and sex")
g3 <- ggplot(modified_data, aes(x = high_use, y = studytime, col=sex))
p3=g3 + geom_boxplot() + ylab("studytime - weekly study time") + ggtitle("Student who dedicate time to study by alcohol consumption and sex")
ggarrange(p1, p2, p3 , labels = c("A", "B","C"), ncol = 2, nrow = 2)#sex_high_use <- CrossTable(my_variables_data$sex, my_variables_data$high_use)
#goout_high_use <- CrossTable(my_variables_data$goout, my_variables_data$high_use)
#goout_high_use <- CrossTable(my_variables_data$goout, my_variables_data$high_use)
#studytime_high_use <- CrossTable(my_variables_data$studytime, my_variables_data$high_use)
To statistically explore the relationship between my chosen variable and high/low alcohol consumption variable logistic regression moded was build using the R function glm().
m <- glm(high_use ~ absences + sex + goout + studytime , data = modified_data, family = "binomial")
Inorder to be able to interpret the fitted model in detail, the summary, coefficients and confidence intervals of the fitted model are calculated as shown below.
summary(m)##
## Call:
## glm(formula = high_use ~ absences + sex + goout + studytime,
## family = "binomial", data = modified_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9103 -0.7892 -0.5021 0.7574 2.6021
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.20483 0.59429 -5.393 6.94e-08 ***
## absences 0.07811 0.02244 3.481 0.000499 ***
## sexM 0.78173 0.26555 2.944 0.003242 **
## goout 0.72677 0.11994 6.059 1.37e-09 ***
## studytime -0.42116 0.17058 -2.469 0.013551 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 381.31 on 377 degrees of freedom
## AIC: 391.31
##
## Number of Fisher Scoring iterations: 4
coef(m)## (Intercept) absences sexM goout studytime
## -3.20483157 0.07810746 0.78173290 0.72676824 -0.42116184
OR <- coef(m) %>% exp
CI <- confint(m) %>% exp## Waiting for profiling to be done...
cbind(OR,CI)## OR 2.5 % 97.5 %
## (Intercept) 0.04056573 0.01222314 0.1263579
## absences 1.08123884 1.03577383 1.1324143
## sexM 2.18525582 1.30370562 3.7010763
## goout 2.06838526 1.64470314 2.6349716
## studytime 0.65628388 0.46510383 0.9099505
As shown above all the four variables are statistically significant. goout has the lowest p-value suggesting a strong association of going out with friends and high/low alcohol consumption. The positive coefficient for goout predictor suggests that all other variables being equal, going out with friends increase alchol consumption. In the logit model, the response variable is log odds: \(ln(odds) = ln(p/(1-p)) =\alpha+ \beta*x_1 + \beta*x_2 + … + \beta*x_n\). A unit increase in going out with friends increase the log odds by 0.72677. The variable absences and sex have also lowest p-vlaue 0.000499 and 0.003242, respectively. The positive coefficient for absences suggests that all other variables being equal, a unit increase in the absences increase alchol consumption. A unit increase in absences increase the log odds by 0.07811. sex is also the ohter signifcant value with p-value (0.003242) and suggesting association of the sex of the student with high\low alchol consumption. The positive coefficient for sex suggests that all other variables being equal, the male students are high likely alchol consumption. The male student inccrease the log odds by 0.78173. studytime is also the other siginficant variable and the negative coefficient for this variable suggests that all other variables being equal, a unit increase in studytime reduces the alchol consumption.A unit increase in studytime reduces the log odds by by 0.42116.
The ratio of expected “success” to “failures” defined as odds: p/(1-p). Odds are an alternative way of expressing probabilities. Higher odds corresponds to a higher probability of success when OR > 1. The exponents of the coefficients of a logistic regression model interpreted as odds ratios between a unit change (vs no change) in the corresponding explanatory variable. The exponents of goout predector coefficient is 2.06838526 and this suggests a unit change in the goout while all other the variables being equal, the odd ratio is 2.06838526. The exponents of sex predector coefficient is 2.18525582 and this suggests a unit change in the sex while all other the variables being equal, the odd ratio is 2.18525582. In simlar way, The exponents of absences predector coefficient is 1.08123884 and this suggests a unit change in the absences while all other the variables being equal, the odd ratio is 1.08123884. The exponents of studytime predector coefficient is 1.08123884 and this suggests a unit change in the studytime while all other the variables being equal, the odd ratio is 0.65628388. Hence odds ratio for the significant goout, sex and absences variables are 2.06838526, 2.1852558 and 1.08123884 respectively. It means that goout, sex and absences increase high alcohol consumption by the factor of around 2.07, 2.19 and 1.08. Whereas the odds ratio for studytime is 0.65628388 and it suggests that studytime decreases high alchol consumption.
probabilities <- predict(m, type = "response")
# add the predicted probabilities to 'modified_data'
modified_data <- mutate(modified_data, probability = probabilities)
# use the probabilities to make a prediction of high_use
modified_data <- mutate(modified_data, prediction = probability>0.5)
table(high_use = modified_data$high_use, prediction = modified_data$prediction)## prediction
## high_use FALSE TRUE
## FALSE 252 16
## TRUE 65 49
# the logistic regression model m and dataset alc (with predictions) are available
# define a loss function (average prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# compute the average number of wrong predictions in the (training) data
loss_func(modified_data$high_use, modified_data$probability)## [1] 0.2120419
The above 2x2 cross tabulation indicates that out of 382 observations 81 (65+ 16) were wrongly predicated. The average proportion of incorrect predictions in the data is about 21%. My model has 21% error which is better than the model in DataCamp exercises.
Bonus
# K-fold cross-validation
cv <- cv.glm(data = modified_data, cost = loss_func, glmfit = m, K = 10)
# average number of wrong predictions in the cross validation
cv$delta[1]## [1] 0.2120419
The 10-fold cross-validation result indicates is 0.2172775. There is no so much improvement in using the 10-fold cross-validation. There is no obvious smaller prediction error than what I have predicated in the above 2x2 cross tabulation (0.2120419).
#install.packages("corrplot") install corrplot package
library(GGally)
library(ggplot2)
library(tidyr)
library(dplyr)
library(MASS)##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
library(corrplot)## corrplot 0.84 loaded
library(ggpubr)
library(magrittr)
library(gmodels)
library(boot)
library(knitr)
library("DT")# load the data
data("Boston")
str(Boston)## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)## [1] 506 14
The default installation of R comes with many (usally small) data sets. One of the data setes Boston we are dealing in this week exercise comes from MASS package. The data was originally published by (Harrison et al. 1978) that contains information about the Boston house-price data. Later the data was also published by (Belsley et al. 1980). The Boston dataset has 506 observations and 14 differeent variables. Details about the datasets can be found from this two link [1] and [2]
Visualizing the scatter plot matrix and examining the correltion between Boston variables
p1=ggpairs(Boston,title="scatter plot matrix with correlation")
p1 + theme(plot.title = element_text(size = rel(2)))cor_matrix<-cor(Boston) %>% round(digits=2)
# visualize the correlation matrix
corrplot(cor_matrix, method="circle", type = "lower", cl.pos = "b", tl.col="black", tl.pos = "d", tl.cex = 1.2,title="Correlations plot",mar=c(0,0,1,0))#kable(summary(Boston))
summary(Boston)## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
boston_scaled <- scale(Boston)
summary(boston_scaled)## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
boston_scaled<-as.data.frame(boston_scaled)
bins <- quantile(boston_scaled$crim)
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label=c("low", "med_low", "med_high", "high"))
boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)
pairs(boston_scaled[1:13], main = "Scaled scatter plot matrix",
pch = 21, bg = c("red", "green3", "blue","yellow")[boston_scaled$crime],
oma=c(4,4,6,12),upper.panel = NULL)
par(xpd=TRUE)
legend(0.85, 0.7, as.vector(unique(boston_scaled$crime)),
fill=c("red", "green3", "blue","yellow"))n <- nrow(boston_scaled)
# choose randomly 80% of the rows
ind <- sample(n, size = n * 0.8)
train <- boston_scaled[ind,]
test <- boston_scaled[-ind,]
# save the correct classes from test data
correct_classes <- test$crime
# remove the crime variable from test data
test <- dplyr::select(test, -crime)# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)
lda.fit## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2500000 0.2574257 0.2475248 0.2450495
##
## Group means:
## zn indus chas nox rm
## low 0.8062909 -0.8797351 -0.116404306 -0.8483255 0.3803591
## med_low -0.1046442 -0.2954616 0.030524797 -0.5618729 -0.1116838
## med_high -0.3740445 0.1554470 0.121380956 0.3278884 0.1573625
## high -0.4872402 1.0171737 0.006051757 1.0834474 -0.3800769
## age dis rad tax ptratio
## low -0.8567482 0.8625071 -0.6646214 -0.7143564 -0.36316233
## med_low -0.3452079 0.3605328 -0.5489875 -0.4853225 -0.05141138
## med_high 0.3790597 -0.3481769 -0.4087862 -0.3225233 -0.29170877
## high 0.8134548 -0.8497334 1.6375616 1.5136504 0.78011702
## black lstat medv
## low 0.3805451 -0.73783216 0.462571952
## med_low 0.3171097 -0.12123625 -0.001057877
## med_high 0.1330148 -0.06788392 0.249121208
## high -0.8193198 0.84184742 -0.586864380
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.059023882 0.66820208 -0.96848469
## indus 0.008666569 -0.25782237 0.28513674
## chas -0.083933933 0.02401470 0.18573369
## nox 0.471743717 -0.80563966 -1.29943356
## rm -0.106194476 -0.07562733 -0.10607831
## age 0.241780124 -0.30911793 -0.11074253
## dis -0.019154318 -0.24517047 0.18744603
## rad 3.132764406 0.93431339 -0.09512087
## tax 0.006335134 0.04570825 0.56550330
## ptratio 0.137041450 0.02975160 -0.22492206
## black -0.147793203 0.04152248 0.16667936
## lstat 0.281891472 -0.19846040 0.43795982
## medv 0.261948118 -0.39543880 -0.21288019
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9552 0.0331 0.0117
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
classes <- as.numeric(train$crime)
# plot the lda results
plot(lda.fit, dimen = 2, col=classes)
lda.arrows(lda.fit, myscale = 1.3)lda.pred <- predict(lda.fit, newdata = test)
table(correct = correct_classes, predicted = lda.pred$class)## predicted
## correct low med_low med_high high
## low 18 8 0 0
## med_low 3 17 2 0
## med_high 0 8 17 1
## high 0 0 0 28
data('Boston')
boston_scaled2 <- scale(Boston)
dist_eu <- dist(boston_scaled2)
set.seed(12345)
k_max <- 13
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled2, k)$tot.withinss})
#qplot(x = 1:k_max, y = twcss, geom =c("point", "line"),span = 0.2)
# k-means clustering
b=x = 1:k_max
aa=data.frame(cbind(b,twcss))
ggplot(data=aa, aes(x=b, y=twcss, group=1)) +
geom_line(color="red")+
geom_point()km <-kmeans(boston_scaled2, centers = 2)
# plot the Boston dataset with clusters
pairs(boston_scaled2, main = "K-means clustering",col = km$cluster, upper.panel = NULL)